Neural Model for Content Extraction in Multilingual Web Documents

نویسندگان

  • Kolla Bhanu Prakash
  • M. A. Dorai Rangaswamy
  • Arun Raja Raman
  • Rafael C. Gonzalez
  • Richard E. Woods
  • Steven L. Eddins
  • Bing Zhao
  • Stephen Vogel
  • M. A. Dorai
  • Ranga Swamy
  • Bhanu Prakash
چکیده

Neural model for multilingual web documents in Indian sub-continent is gaining prominence in day to day life. While translation and transliteration are gaining its importance on web pages, it becomes difficult for the common man to understand what the web page says about, especially when regional language is not known to the user. So, our effort here is a generic tool applied in Neural networks to overcome this problem. The model takes inputs in both English and Telugu, an Indian regional language in both printed and handwritten formats. Words having common content are chosen and neural network is used to normalize the output. A sample page from a physics textbook dealing with magnetism is taken for consideration for this paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic metadata mining from multilingual enterprise content

Personalization is increasingly vital especially for enterprises to be able to reach their customers. The key challenge in supporting personalization is the need for rich metadata, such as metadata about structural relationships, subject/concept relations between documents and cognitive metadata about documents (e.g. difficulty of a document). Manual annotation of large knowledge bases with suc...

متن کامل

Multilingual extraction and editing of concept strings for the legal domain

Identifying semantic expressions (so-called concept strings (CSs)) in multilingual corpora is an important NLP task, as it allows web search engines to define and perform semantic queries over large collection of documents. Existing web search engines in the legal domain are mainly limited to keyword search, in which the query word is matched against the textual content of the documents. This p...

متن کامل

Discovering Parallel Text from the World Wide Web

Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...

متن کامل

Trillions of Comparable Documents

We propose a novel multilingual Web crawler and sentence mining system to continuously mine and extract parallel sentences from trillions of websites, unconstrained by domain or url structures, or publication dates. The system is divided into three main modules, namely Web crawler, comparable and parallel website matching and parallel sentence extraction. Previous methods in mining parallel sen...

متن کامل

Cross-Language Hybrid Keyword and Semantic Search

The growth of multilingual web content and increasing internationalization portends the need for cross-language information retrieval. As a solution to this problem for narrow-domain, data-rich web content, we offer ML-HyKSS: MultiLingual Hybrid Keyword and Semantic Search. The key component of ML-HyKSS is a collection of linguistically grounded conceptual-model instances called extraction onto...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013